Method for three-dimensional human pose estimation

ABSTRACT

The invention discloses a method for three-dimensional human pose estimation, which can realize the real-time and high-precision 3D human pose estimation without high configuration hardware support and precise human body model. In this method for three-dimensional human pose estimation, including the following steps: (1) establishing a three-dimensional human body model matching the object, which is a cloud point human body model of visible spherical distribution constraint. (2) Matching and optimizing between human body model for human body pose tracking and depth point cloud. (3) Recovering for pose tracking error based on dynamic database retrieval.

CROSS REFERENCE TO THE RELATED APPLICATIONS

This application is based upon and claims priority to Chinese PatentApplication No. CN 201910201559.3, filed on Mar. 18, 2019, the entirecontents of which are incorporated herein by reference.

TECHNICAL FIELD

The present invention relates to the technical field of computer visionand pattern recognition, and particularly to a method forthree-dimensional human pose estimation.

BACKGROUND OF THE INVENTION

Three-dimensional (3D) human pose estimation based on computer visiontechnology has been widely used in many fields of human life, such ascomputer animation, medicine, human-computer interaction and so on. Withthe introduction of low-cost RGB-D sensors (such as Kinect), comparedwith RGB visual information, depth image can greatly avoid data defectscaused by complex background and changes in light conditions. Therefore,the performance of 3D human pose estimation is improved obviously byusing the depth information, which has become the current researchhotspot. At present, many methods of 3D human pose estimation based ondepth data have achieved better recognition results, but the furtherimprovement of recognition accuracy still needs to overcome two inherentserious defects of depth data acquired by sensors: noise and occlusion.

There are two kinds of methods for 3D human pose estimation based ondepth information: discriminant method and generating method. The formerrelies on a large number of training data, and therefore can adapt tothe changes of different body types, but most of them can not get higherprecision in the case of complex motion; the latter usually depends oncomplex and accurate human body model, and therefore can get highprecision in the case of data loss, but in the case of fast and complexmotion, it is easy to fall into local optimization and lose the globaloptimum solution. It can be seen that the implementation ofhigh-performance 3D human pose estimation methods often depends on thefollowing points: 1) a large number of accurate training data sets; 2) ahuge pose database for tracking error recovery; 3) GPU accelerationsupport; 4) accurate 3D human model. These limitations limit theapplication of real-time human-computer interaction on the platform ofgeneral hardware configuration.

SUMMARY

The technical problem addressed by the present invention is to overcomethe deficiency in the prior art, and to provide a method forthree-dimensional human pose estimation, which can realize the real-timeand high-precision 3D human pose estimation without high configurationhardware support and precise human body model.

The technical solution of the present invention is that, in this methodfor three-dimensional human pose estimation, including the followingsteps:

-   -   (1) establishing a three-dimensional human body model matching        the object, which is a cloud point human body model of visible        spherical distribution constraint.    -   (2) Matching and optimizing between human body model for human        body pose tracking and depth point cloud.    -   (3) Recovering for pose tracking error based on dynamic database        retrieval.

The invention takes the depth map sequence as the input, optimizes andmatches with the established 3D human body model and the 3D point cloudtransformed from the depth map. The optimization process combines theglobal translation transformation and the local rotation transformation,and uses the dynamic database to recover the pose when the trackingerror occurs, finally realizes the fast and accurate pose tracking, andobtains the estimated position of the joint points from the human bodymodel. So the real-time and high-precision three-dimensional human poseestimation can be realized without high configuration hardware supportand accurate human body model.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows that the ball set represents the human body model and thespherical point set represents the human body model. FIG. 1a shows theball set represents the human body model and the division of parts, andFIG. 1b shows the surface sampling of the ball set.

FIG. 2 shows 11 parts naming and parts parent node division diagram ofhuman body. FIG. 2a shows 11 parts division and naming, and FIG. 2bshows parts parent node division.

FIG. 3 shows the representation of human body direction characteristic.

FIG. 4 shows the minimum bounding box construction based on PCA maindirection.

FIG. 5 shows the average error of the SMMC dataset.

FIG. 6 shows the PDT dataset mean error.

FIG. 7 shows the subjective effect display on the PDT database.

FIG. 8 shows a flow chart of the three-dimensional human pose estimationmethod according to the present invention.

DETAILED DESCRIPTION OF THE EMBODIMENTS

As shown as FIG. 8, in this method for three-dimensional human poseestimation, including the following steps:

-   -   (1) establishing a three-dimensional human body model matching        the object, which is a cloud point human body model of visible        spherical distribution constraint.    -   (2) Matching and optimizing between human body model for human        body pose tracking and depth point cloud.    -   (3) Recovering for pose tracking error based on dynamic database        retrieval.

The invention takes the depth map sequence as the input, optimizes andmatches with the established 3D human body model and the 3D point cloudtransformed from the depth map. The optimization process combines theglobal translation transformation and the local rotation transformation,and uses the dynamic database to recover the pose when the trackingerror occurs, finally realizes the fast and accurate pose tracking, andobtains the estimated position of the joint points from the human bodymodel. So the real-time and high-precision three-dimensional human poseestimation can be realized without high configuration hardware supportand accurate human body model.

Preferably, in step (1):

Representation of human body surface with 57 spherical sets. Each sphereis characterized by a radium and a center, which are initializedempirically. By corresponding all the spheres to 11 body components, wedefine the sphere set S to be the collection of 11 component sphere setmodels, each of which represents a body component. That is,

$\begin{matrix}{{S = {\underset{k = 1}{\bigcup\limits^{11}}S^{k}}}{S^{k} = {\left\{ g_{i}^{k} \right\}_{i = 1}^{N_{k}}:=\left\{ \left\lbrack {c_{i}^{k},r_{i}^{k}} \right\rbrack \right\}_{i = 1}^{N_{k}}}}} & (1)\end{matrix}$

Where c_(i) ^(k),r_(i) ^(k) represent the center, the radius of the ithsphere in the kth component, respectively, and N_(k) represents thenumber of spheres contained in the kth component, with

${\sum\limits_{k = 1}^{11}\; N_{k}} = 57.$

Preferably, in step (1), ignore wrist and ankle movements.

Preferably, in step (1), for all 57 spheres, we construct a directedtree, each node of which corresponds to a sphere. The root of the treeis g₁ ¹, and each of the other nodes has a unique parent node which isdenoted by a black sphere. The definition of the parent nodes is givenby:

parent(S ¹)=g ₁ ¹,parent(S ²)=g ₁ ¹,parent(S ³)=g ₂ ²,parent(S ⁴ =g ₁³,parent(S ⁵)=g ₁ ²,parent(S ⁶)=g ₁ ⁵,parent(S ⁷)=g ₂ ²,parent(S ⁸)=g ₈¹,parent(S ₉)=g ₁ ⁸,parent(S ¹⁰)=g ₂ ¹,parent(S ¹¹)=g ₁ ¹⁰  (2)

Based on this definition, the motion of each body part is considered tobe determined by the rotation motion R_(k) in the local coordinatesystem with its parent node as the origin plus the global translationvector t in the world coordinate system. Using Fibonacci sphericalalgorithm to get spherical point cloud by dense sampling, a cloud pointhuman body model of visible spherical distribution constraint is theformula (3):

$\begin{matrix}{{V = {{\underset{k = 1}{\bigcup\limits^{11}}V^{k}}:={\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}}{d_{k,i}^{j} = \left\lbrack {x_{k,i}^{j},{y_{k,i}^{j}z_{k,i}^{j}}} \right\rbrack^{T}}{x_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\cos \left( {2\; \pi \; j\; \varphi} \right)}}}{y_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\sin \left( {2\; \pi \; j\; \varphi} \right)}}}{z_{k,i}^{j} = {{\left( {{2j} - 1} \right)/N_{i}} - 1}}} & (3)\end{matrix}$

Where Q_(k,i) denotes the number of sampling points of the ith sphere ofthe kth component, and ϕ≈0.618 is the golden section ratio. For example,d_(k,i) ^(j) denotes the direction vector of the jth sampling point ofthe ith sphere of the kth component. Therefore, each point is assigned avisibility attribute, which is determined by the observation coordinatesystem of the point cloud, and whether each point is visible throughvisibility. detection. A point set consisting of all spherical visiblepoints is used to represent the human body model. It is a cloud pointhuman body model of visible spherical distribution constraint.

Preferably, in step (2), the depth point cloud P transformed from thedepth map is sampled to obtain P. Assuming that both the model and thedepth point cloud are in the same camera coordinate system, The cameracorresponding to the depth point cloud is used to constrain the angle ofview, and the intersecting part and the occluding part are removed toretain the visible points V on the model under the current angle ofview. These points represent the model in the current pose. UsingEuclidean distance measure to get P the corresponding point V on v,redefining V:

$\begin{matrix}{{\overset{\_}{\overset{\_}{V}} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}{\overset{\_}{P} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}{{\overset{\_}{p}}_{k,i}^{j}.}}}}}} & (4)\end{matrix}$

Preferably, in step (2),

After the correspondence between P and V is established, the movement ofhuman body as a series of simultaneous and slow movements of all bodycomponent. Therefore, the matching and optimization of the model andpoint cloud convert a global translation t and a series of localrotation. Cost function is formula (5):

$\begin{matrix}{{\min\limits_{R_{k},t}{\sum\limits_{k = 1}^{11}\; \left( {{\lambda \; {\Psi_{corr}\left( {R_{k},t} \right)}} + {\Psi_{joint}\left( {R_{k},t} \right)} + {\mu_{k}{\Psi_{regn}\left( R_{k} \right)}}} \right)}}{{{s.t.\left( R_{k} \right)^{T}}R_{k}} = I}} & (5)\end{matrix}$

Where λ, μ_(k)>0 and are weight parameters, the first term Ψ_(corr)penalizes the distance between model surface point and input depth pointcloud,

${{{\Psi_{corr}\left( {R_{k},t} \right)} = {{\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{{\overset{\_}{Q}}_{k,i}}\; \underset{{points}\mspace{14mu} {of}\mspace{14mu} {VHM}\mspace{14mu} {after}\mspace{14mu} {rotation}\mspace{14mu} {and}\mspace{14mu} {translation}}{\underset{}{c_{parent}^{k} + {R_{k}\left( {c_{i}^{k} - c_{parent}^{k} + {r_{k,i}d_{k,i}^{j}}} \right)} + t}}}} - {\overset{\_}{p}}_{k,i}^{j}}}}^{2}$

Where c_(parent) ^(k) represents the center coordinate of the parentnode of kth component. Based on this constraint, each point of the modelis enforced to locate closer to the corresponding point cloud afterrotation and translation. The second term Ψ_(joint) is formula (6),using the joint position information and position direction informationof the previous frame, it is used as a special marker information torestrict the excessive space movement and position rotation between thetwo frames, and to reduce the difference between the two frames to acertain extent

Ψ_(joint)(R _(k) ,t)=Σ_(m=1) ^(M) ^(k) (α_(k,m) ∥j _(k,m) +t−j _(k,m)^(init)∥²+β_(k,m) ∥R _(k) n _(k,m) −n _(k,m) ^(init)∥²  (6)

Where j_(k,m),j_(k,m) ^(init) represent the position of the mth joint ofthe kth component under current pose and initial pose, respectively.n_(k,m), n_(k,m) ^(init), represent the position of the mth joint andits parent joint under current pose and initial pose, respectively. Theweight α_(k,m), β_(k,m) for balancing the correspondence term andlocation is formula (7):

$\begin{matrix}{{a_{k,m} = \frac{\tau^{k}}{1 + e^{- {({{{j_{k,m} - j_{k,m}^{init}}} - \omega_{2}})}}}}{\beta_{k,m} = \frac{\gamma^{k}}{1 + e^{- {({{\arccos {({n_{k,m}^{r}n_{k,m}^{init}})}} - \omega_{2}})}}}}} & (7)\end{matrix}$

Where ω₂,ω₃, >0, and are weight parameters for controlling the range oferror. τ^(k),γ^(k) are scaling parameters which defined by:

$\begin{matrix}{{\tau^{k} = \frac{\mu_{1}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{\gamma^{k} = \frac{\mu_{2}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{{{Dist}\left( {{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}} \right)} = {\frac{1}{{\overset{\_}{P}}^{k}}{\sum\limits_{k = 1}^{11}\; {\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{Q_{k,i}}\; {{c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}} - {\overset{\_}{p}}_{k,i}^{j}}}}}}}}} & (8)\end{matrix}$

Where Dist(P ^(k), V ^(k)) represents the average distance betweencorresponding points of P ^(k),V ^(k). ω₁>0 is used to determine thedistance error threshold. τ^(k),γ^(k) are only solved beforeoptimization and after the first corresponding relationship isdetermined, and remains unchanged in the iterative process.α_(k,m),β_(k,m) update when updating correspondence.

The third term Ψ_(regu) is formula (9). The large rotation of each partin the iterative process is constrained. The motion between two adjacentframes is regarded as the process of simultaneous change of each part

Ψ_(regu)(R _(k))=∥R _(k) −I∥ ²(9).

Preferably, in step (3),

Using the overlap rate θ_(overlap) and cost function value θ_(cost) ofthe input depth point cloud and the constructed human body model on thetwo-dimensional plane to determine whether the current tracking fails.Assuming that human limb motion segments have the repetitivecharacteristics in time series, the direction information of each bodypart is used to represent human three-dimensional motion, the upper andlower trunk parts are simplified into two mutually perpendicular maindirections, each part of the limbs is represented by a direction vector,and the direction of the head is ignored, which is expressed as aformula (10):

v=(v ₁ ^(r) , . . . v ₁₀ ^(r))^(r)  (10)

Where v₁, v₂ correspond to the pairwise perpendicular unit directions ofupper torso, lower torso, respectively, and v₃, . . . , v₁₀ correspondto the unit direction of all components except upper torso, lower torso,head.

Preferably, in step (3),

PCA is used to extract the main direction [e₁, e₂, e₃] of the depthpoint cloud, and the minimum bounding box [w,d,h] of the main directionis used to represent the characteristics of the depth point cloud, whichis formula (11):

e=(we ₁ ^(r) ,de ₁ ^(r) ,he ₃ ^(r))^(r)  (11)

If the cost function of matching is less than the threshold value in thetracking process θ_(overlap)≤θ₁ and θ_(cost)≤θ₂, the tracking issuccessful and update the database model D by extracting feature s[e,v]. The extracted characteristics [e,v] are saved in database as apair of characteristic vectors. When the tracking fails, the Euclideandistance is calculated by using the characteristics e of thecorresponding depth point cloud in the database, the first fivepositions {[e^((i)), v^((i))]}_(i=1) ⁵ with the smallest distance arefound in the database, and the position with the highest overlap ratewith the current input depth point cloud is retrieved by using v^((i)),i=1, . . . , 5 to recover the visible spherical distribution constraintpoint cloud manikin, so as to facilitate the recovery from the trackingfailure.

The invention is described in more detail below.

The invention takes the depth map sequence as the input, optimizes thematching between the established 3D human body model and the 3D pointcloud transformed from the depth map. The optimization process combinesthe global translation transformation and the local rotationtransformation, and uses the dynamic database to recover the pose whenthe tracking error occurs, finally realizes the fast and accurate posetracking, and obtains the estimated joint position from the human bodymodel. The invention mainly includes three key technical points: (1)establishing a three-dimensional human body model matching the object,which combines the advantages of geometric model and mesh model. (2) Onthe basis of the model, the matching optimization problem between thehuman body model and the point cloud is transformed into solving theglobal translation transformation matrix and the local rotationtransformation matrix based on the determination of the correspondingrelationship between the human body model and the depth point cloud. (3)Building a small dynamic database to track reinitialization in case offailure.

1. A cloud point human body model of visible spherical distributionconstraint:

As shown as FIG. 1a , representation of human body surface with 57spherical sets. Each sphere is characterized by a radium and a center,which are initialized empirically. As shown as FIG. 2a , bycorresponding all the spheres to 11 body components, we define thesphere set S to be the collection of 11 component sphere set models,each of which represents a body component. That is,

$\begin{matrix}{{S = {\underset{k = 1}{\bigcup\limits^{11}}S^{k}}}{S^{k} = {\left\{ g_{i}^{k} \right\}_{i = 1}^{N_{k}}:=\left\{ \left\lbrack {c_{i}^{k},r_{i}^{k}} \right\rbrack \right\}_{i = 1}^{N_{k}}}}} & (1)\end{matrix}$

Where c_(i) ^(k), r_(i) ^(k) represent the center, the radius of the ithsphere in the kth component, respectively, and N_(k) represents thenumber of spheres contained in the kth component, with

$\begin{matrix}{{S = {\underset{k = 1}{\bigcup\limits^{11}}S^{k}}}{S^{k} = {\left\{ g_{i}^{k} \right\}_{i = 1}^{N_{k}}:=\left\{ \left\lbrack {c_{i}^{k},r_{i}^{k}} \right\rbrack \right)_{i = 1}^{N_{k}}}}} & (1)\end{matrix}$

For simplification, ignore wrist and ankle movements.

For all 57 spheres, we construct a directed tree, each node of whichcorresponds to a sphere, as shown as FIG. 2b . The root of the tree isg₁ ¹, and each of the other nodes has a unique parent node which isdenoted by a black sphere. The definition of the parent nodes is givenby:

parent(S ¹)=g ₁ ¹,parent(S ²)=g ₁ ¹,parent(S ³)=g ₃ ²,parent(S ⁴ =g ₁³,parent(S ⁵)=g ₁ ²,parent(S ⁶)=g ₁ ⁵,parent(S ⁷)=g ₂ ²,parent(S ⁸)=g_(z) ¹,parent(S ⁹)=g ₁ ⁸,parent(S ¹⁰)=g ₂ ¹,parent(S ¹¹)=g ₁ ¹⁰  (2)

Based on this definition, the motion of each body part is considered tobe determined by the rotation motion R_(k) in the local coordinatesystem with its parent node as the origin plus the global translationvector t in the world coordinate system. FIG. 1b shows the surfacesampling of the ball set. Using Fibonacci spherical algorithm to getspherical point cloud by dense sampling, a cloud point human body modelof visible spherical distribution constraint is the formula (3):

$\begin{matrix}{{V = {{\underset{k = 1}{\bigcup\limits^{11}}v^{k}}:={\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}}{d_{k,i}^{j} = \left\lbrack {x_{k,i}^{j},y_{k,i}^{j},z_{k,i}^{j}} \right\rbrack^{r}}{x_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\cos \left( {2\; \pi \; j\; \varphi} \right)}}}{y_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\sin \left( {2\; \pi \; j\; \varphi} \right)}}}\; {z_{k,i}^{j} = {{\left( {{2j} - 1} \right)/N_{i}} - 1}}} & (3)\end{matrix}$

Where Q_(k,i) denotes the number of sampling points of the ith sphere ofthe kth component, and ϕ≈0.618 is the golden section ratio. For example,d_(k,i) ^(j) denotes the direction vector of the jth sampling point ofthe ith sphere of the kth component. Therefore, each point is assigned avisibility attribute, which is determined by the observation coordinatesystem of the point cloud, and whether each point is visible throughvisibility. detection. A point set consisting of all spherical visiblepoints is used to represent the human body model. It is a cloud pointhuman body model of visible spherical distribution constraint. At thistime, the model can not only control the shape of human bodyconveniently by changing the parameters of sphere definition, but alsoaccurately represent the human body's pose by optimizing and matchingwith the input point cloud.

2. Matching and optimizing between human body model for human body posetracking and depth point cloud

The depth point cloud P transformed from the depth map is sampled toobtain P. Assuming that both the model and the depth point cloud are inthe same camera coordinate system, The camera corresponding to the depthpoint cloud is used to constrain the angle of view, and the intersectingpart and the occluding part are removed to retain the visible points Von the model under the current angle of view. These points represent themodel in the current pose. Using Euclidean distance measure to get P.the corresponding point V on V, redefining V:

$\begin{matrix}{{\overset{\_}{\overset{\_}{V}} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}{\overset{\_}{P} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}{{\overset{\_}{p}}_{k,i}^{j}.}}}}}} & (4)\end{matrix}$

After the correspondence between P and V is established, the movement ofhuman body as a series of simultaneous and slow movements of all bodycomponent. Therefore, the matching and optimization of the model andpoint cloud convert a global translation t and a series of localrotation. Cost function is formula (5):

$\begin{matrix}{{\min\limits_{R_{k}t}{\sum\limits_{k = 1}^{11}\; \left( {{\lambda \; {\Psi_{corr}\left( {R_{k},t} \right)}} + {\Psi_{joint}\left( {R_{k},t} \right)} + {\mu_{k}{\Psi_{regn}\left( R_{k} \right)}}} \right)}}{{{s.t.\left( R_{k} \right)^{r}}R_{k}} = I}} & (5)\end{matrix}$

Where λ, μ_(k)>0 and are weight parameters, the first term Ψ_(corr)penalizes the distance between model surface point and input depth pointcloud,

${\Psi_{corr}\left( {R_{k},t} \right)} = {\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{{\overset{\_}{Q}}_{k,i}}\; {{\underset{{points}\mspace{14mu} {of}\mspace{14mu} {VHM}\mspace{14mu} {after}\mspace{14mu} {rotation}\mspace{14mu} {and}\mspace{14mu} {translation}}{\underset{}{c_{parent}^{k} + {R_{k}\left( {c_{i}^{k} - c_{parent}^{k} + {r_{k,i}d_{k,i}^{j}}} \right)} + t}} - {\overset{\_}{p}}_{k,i}^{j}}}^{2}}}$

Where c_(parent) ^(k) represents the center coordinate of the parentnode of kth component. Based on this constraint, each point of the modelis enforced to locate closer to the corresponding point cloud afterrotation and translation. The second term Ψ_(joint) is formula (6),using the joint position information and position direction informationof the previous frame, it is used as a special marker information torestrict the excessive space movement and position rotation between thetwo frames, and to reduce the difference between the two frames to acertain extent

Ψ_(joint)(R _(k) ,t)=Σ_(m=1) ^(M) ⁶ (α_(k,m) ∥j _(k,m) +t−j _(k,m)^(init)∥²+β_(k,m) ∥R _(k) n _(k,m) −n _(k,m) ^(init)∥²  (6)

Where j_(k,m), j_(k,m) ^(init), represent the position of the mth jointof the kth component under current pose and initial pose, respectively.n_(k,m),n_(k,m) ^(init), represent the position of the mth joint and itsparent joint under current pose and initial pose, respectively. Theweight α_(k,m), β_(k,m) for balancing the correspondence term andlocation is formula (7):

$\begin{matrix}{{\alpha_{k,m} = \frac{\tau^{k}}{1 + e^{- {({{{j_{k,m} - j_{k,m}^{init}}} - \omega_{2}})}}}}{\beta_{k,m} = \frac{\gamma^{k}}{1 + e^{- {({{\arccos {({n_{k,m}^{r}n_{k,m}^{init}})}} - \omega_{2}})}}}}} & (7)\end{matrix}$

Where ω₂, ω₃>0, and are weight parameters for controlling the range oferror. τ^(k),γ^(k) are scaling parameters which defined by:

$\begin{matrix}{{\tau^{k} = \frac{\mu_{1}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{\gamma^{k} = \frac{\mu_{2}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{{{Dist}\left( {{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}} \right)} = {\frac{1}{{\overset{\_}{P}}^{k}}{\sum\limits_{k = 1}^{11}\; {\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{Q_{k,i}}\; {{c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}} - {\overset{\_}{p}}_{k,i}^{j}}}}}}}}} & (8)\end{matrix}$

Where Dist(P ^(k),V ^(k)) represents the average distance betweencorresponding points of P ^(k),V ^(k). ω₁>0 is used to determine thedistance error threshold. τ^(k), γ^(k) are only solved beforeoptimization and after the first corresponding relationship isdetermined, and remains unchanged in the iterative process. α_(k,m),β_(k,m) update when updating correspondence.

The third term Ψ_(regu) is formula (9). The large rotation of each partin the iterative process is constrained. The motion between two adjacentframes is regarded as the process of simultaneous change of each part

Ψ_(ragu)(R _(k))=∥R _(k) −I∥ ²  (9).

3. Recovering for pose tracking error based on dynamic databaseretrieval

Since the invention belongs to the unsupervised attitude estimationmethod, the attitude recovery operation is required when the trackingerror occurs. Using the overlap rate θ_(overlap) and cost function valueθ_(cost) of the input depth point cloud and the constructed human bodymodel on the two-dimensional plane to determine whether the currenttracking fails. Assuming that human limb motion segments have therepetitive characteristics in time series, therefore, an attitudetracking recovery method based on small dynamic database is proposed.The direction information of each body part is used to represent humanthree-dimensional motion, as shown as FIG. 3, the upper and lower trunkparts are simplified into two mutually perpendicular main directions,each part of the limbs is represented by a direction vector, and thedirection of the head is ignored, which is expressed as a formula (10):

v=(v ₁ ^(r) , . . . v ₁₀ ^(r))^(r)  (10)

Where v₁, v₂, correspond to the pairwise perpendicular unit directionsof upper torso, lower torso, respectively, and v₃, . . . , v₁₀correspond to the unit direction of all components except upper torso,lower torso, head.

As shown as FIG. 4, PCA is used to extract the main direction [e₁, e₂,e₃] of the depth point cloud, and the minimum bounding box [w,d,h] ofthe main direction is used to represent the characteristics of the depthpoint cloud, which is formula (11):

e=(we ₁ ^(r) ,de ₁ ^(r) ,he ₃ ^(r))^(r)  (11)

If the cost function of matching is less than the threshold value in thetracking process θ_(overlap)≤θ₁ and θ_(cost)≤θ₂, the tracking issuccessful and update the database model D by extracting feature s[e,v]. The extracted characteristics [e,v] are saved in database as apair of characteristic vectors. When the tracking fails, the Euclideandistance is calculated by using the characteristics e of thecorresponding depth point cloud in the database, the first fivepositions {[e^((i)), v^((i))]}_(i=1) ⁵ with the smallest distance arefound in the database, and the position with the highest overlap ratewith the current input depth point cloud is retrieved by using v^((i)),i=1, . . . , 5 to recover the visible spherical distribution constraintpoint cloud manikin, so as to facilitate the recovery from the trackingfailure.

The invention has been verified on the open data set SMMC and PDT dataset, and good experimental results have been obtained. FIG. 5 shows theaverage error of the invention on the SMMC data set. The SMMC data setis relatively simple in operation. It can be seen that our method isequivalent to the result of the best method at present. FIG. 6 shows theaverage error of the PDT data set of the invention. The action of thePDT data set is complex and challenging, but the method of the inventioncan also achieve good results. Table 1 shows the efficiency comparisonbetween PDT and SMMC database and other similar methods. Compared withother methods, the average speed of the invention can achieve real-timewithout GPU acceleration. FIG. 7 shows the subjective effects of somecomplex postures on the PDT dataset, and the experimental results showthat the algorithm can still achieve better estimation results incomplex actions.

TABLE 1 algorithm real-time (Y/N) GPU (Y/N) Ding&Fan N N Ye&Yang Y YVasileiadis et al Y Y The invention Y N

The above contents are only the preferable embodiments of the presentinvention, and do not limit the present invention in any manner. Anyimprovements, amendments and alternative changes made to the aboveembodiments according to the technical spirit of the present inventionshall fall within the claimed scope of the present invention.

1. A method for three-dimensional human pose estimation, including the following steps: (1) establishing a three-dimensional human body model matching the object, which is a cloud point human body model of visible spherical distribution constraint. (2) Matching and optimizing between human body model for human body pose tracking and depth point cloud. (3) Recovering for pose tracking error based on dynamic database retrieval.
 2. The method for three-dimensional human pose estimation according to the claim 1, in step (1): Representation of human body surface with 57 spherical sets. Each sphere is characterized by a radium and a center, which are initialized empirically. By corresponding all the spheres to 11 body components, we define the sphere set S to be the collection of 11 component sphere set models, each of which represents a body component. That is, $\begin{matrix} {{S = {\underset{k = 1}{\bigcup\limits^{11}}S^{k}}}{S^{k} = {\left\{ g_{i}^{k} \right\}_{i = 1}^{N_{k}}:=\left\{ \left\lbrack {c_{i}^{k},r_{i}^{k}} \right\rbrack \right)_{i = 1}^{N_{k}}}}} & (1) \end{matrix}$ Where c_(i) ^(k), r_(i)k represent the center, the radius of the ith sphere in the kth component, respectively, and N_(k) represents the number of spheres contained in the kth component, with ${\sum\limits_{k - 1}^{11}N_{k}} = {5{7.}}$
 3. The method for three-dimensional human pose estimation according to the claim 2, in step (1), ignore wrist and ankle movements.
 4. The method for three-dimensional human pose estimation according to the claim 3, in step (1), for all 57 spheres, we construct a directed tree, each node of which corresponds to a sphere. The root of the tree is g₁ ¹, and each of the other nodes has a unique parent node which is denoted by a black sphere. The definition of the parent nodes is given by: parent(S ¹)=g ₁ ¹,parent(S ²)=g ₁ ¹,parent(S ³)=g ₂ ²,parent(S ⁴)=g ₁ ³,parent(S ⁵)=g ₁ ²,parent(S ⁶)=g ₁ ⁵,parent(S ⁷)=g ₂ ²,parent(S ⁸)=g ₂ ¹,parent(S ⁹)=g ₁ ⁸,parent(S ¹⁰)=g ₂ ¹,parent(S ¹¹)=g ₁ ¹⁰  (2) Based on this definition, the motion of each body part is considered to be determined by the rotation motion R_(k) in the local coordinate system with its parent node as the origin plus the global translation vector t in the world coordinate system. Using Fibonacci spherical algorithm to get spherical point cloud by dense sampling, a cloud point human body model of visible spherical distribution constraint is the formula (3): $\begin{matrix} {{V = {{\underset{k = 1}{\bigcup\limits^{11}}v^{k}}:={\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}}{d_{k,i}^{j} = \left\lbrack {x_{k,i}^{j},y_{k,i}^{j},z_{k,i}^{j}} \right\rbrack^{r}}{x_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\cos \left( {2\; \pi \; j\; \varphi} \right)}}}{y_{k,i}^{j} = {\sqrt{1 - \left( z_{k,i}^{j} \right)^{2}} \cdot {\sin \left( {2\; \pi \; j\; \varphi} \right)}}}\; {z_{k,i}^{j} = {{\left( {{2j} - 1} \right)/N_{i}} - 1}}} & (3) \end{matrix}$ Where Q_(k,i) denotes the number of sampling points of the ith sphere of the kth component, and ϕ≈0.618 is the golden section ratio. For example, d_(k,i) ^(j) denotes the direction vector of the jth sampling point of the ith sphere of the kth component. Therefore, each point is assigned a visibility attribute, which is determined by the observation coordinate system of the point cloud, and whether each point is visible through visibility. detection. A point set consisting of all spherical visible points is used to represent the human body model. It is a cloud point human body model of visible spherical distribution constraint.
 5. The method for three-dimensional human pose estimation according to claim 4, in step (2), the depth point cloud P transformed from the depth map is sampled to obtain P. Assuming that both the model and the depth point cloud are in the same camera coordinate system, The camera corresponding to the depth point cloud is used to constrain the angle of view, and the intersecting part and the occluding part are removed to retain the visible points V on the model under the current angle of view. These points represent the model in the current pose. Using Euclidean distance measure to get P the corresponding point V on V, redefining V: $\begin{matrix} {{\overset{\_}{\overset{\_}{V}} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}\left\{ {c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}}} \right\}}}}}{\overset{\_}{P} = {\underset{k = 1}{\bigcup\limits^{11}}{\underset{i = 1}{\bigcup\limits^{N_{k}}}{\underset{j = 1}{\bigcup\limits^{Q_{k,i}}}{{\overset{\_}{p}}_{k,i}^{j}.}}}}}} & (4) \end{matrix}$
 6. The method for three-dimensional human pose estimation according to claim 5, in step (2), After the correspondence between P and V is established, the movement of human body as a series of simultaneous and slow movements of all body component. Therefore, the matching and optimization of the model and point cloud convert a global translation t and a series of local rotation. Cost function is formula (5): $\begin{matrix} {{\min\limits_{R_{k}t}{\sum\limits_{k = 1}^{11}\; \left( {{\lambda \; {\Psi_{corr}\left( {R_{k},t} \right)}} + {\Psi_{joint}\left( {R_{k},t} \right)} + {\mu_{k}{\Psi_{regn}\left( R_{k} \right)}}} \right)}}{{{s.t.\left( R_{k} \right)^{r}}R_{k}} = I}} & (5) \end{matrix}$ Where λ, μ_(k)>0 and are weight parameters, the first term Ψ_(corr) penalizes the distance between model surface point and input depth point cloud, ${\Psi_{corr}\left( {R_{k},t} \right)} = {\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{{\overset{\_}{Q}}_{k,i}}\; {{\underset{{points}\mspace{14mu} {of}\mspace{14mu} {VHM}\mspace{14mu} {after}\mspace{14mu} {rotation}\mspace{14mu} {and}\mspace{14mu} {translation}}{\underset{}{c_{parent}^{k} + {R_{k}\left( {c_{i}^{k} - c_{parent}^{k} + {r_{k,i}d_{k,i}^{j}}} \right)} + t}} - {\overset{\_}{p}}_{k,i}^{j}}}^{2}}}$ Where c_(parent) ^(k) represents the center coordinate of the parent node of kth component. Based on this constraint, each point of the model is enforced to locate closer to the corresponding point cloud after rotation and translation. The second term Ψ_(joint) is formula (6), using the joint position information and position direction information of the previous frame, it is used as a special marker information to restrict the excessive space movement and position rotation between the two frames, and to reduce the difference between the two frames to a certain extent Ψ_(joint)(R _(k) ,t)=Σ_(m=1) ^(M) ^(k) (α_(k,m) ∥j _(k,m) +t−j _(k,m) ^(init)∥²+β_(k,m) ∥R _(k) n _(k,m) −n _(k,m) ^(init)∥²  (6) Where j_(k,m), j_(k,m) ^(init) represent the position of the mth joint of the kth component under current pose and initial pose, respectively. n_(k,m), n_(k,m) ^(init) represent the position of the mth joint and its parent joint under current pose and initial pose, respectively. The weight α_(k,m), β_(k,m) for balancing the correspondence term and location is formula (7): $\begin{matrix} {{\alpha_{k,m} = \frac{\tau^{k}}{1 + e^{- {({{{j_{k,m} - j_{k,m}^{init}}} - \omega_{2}})}}}}{\beta_{k,m} = \frac{\gamma^{k}}{1 + e^{- {({{\arccos {({n_{k,m}^{r}n_{k,m}^{init}})}} - \omega_{2}})}}}}} & (7) \end{matrix}$ Where ω₂, ω₃>0, and are weight parameters for controlling the range of error. τ^(k),γ^(k) are scaling parameters which defined by: $\begin{matrix} {{\tau^{k} = \frac{\mu_{1}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{\gamma^{k} = \frac{\mu_{2}}{1 + e^{({{{Dist}({{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}})} - \omega_{2}})}}}{{{Dist}\left( {{\overset{\_}{P}}^{k},{\overset{\_}{\overset{\_}{V}}}^{k}} \right)} = {\frac{1}{{\overset{\_}{P}}^{k}}{\sum\limits_{k = 1}^{11}\; {\sum\limits_{i = 1}^{N_{k}}\; {\sum\limits_{j = 1}^{Q_{k,i}}\; {{c_{i}^{k} + {r_{i}^{k}d_{k,i}^{j}} - {\overset{\_}{p}}_{k,i}^{j}}}}}}}}} & (8) \end{matrix}$ Where Dist(P ^(k), V ^(k)) represents the average distance between corresponding points of P ^(k),V ^(k). ω₁>0 is used to determine the distance error threshold. τ^(k),γ^(k) are only solved before optimization and after the first corresponding relationship is determined, and remains unchanged in the iterative process. α_(k,m),β_(k,n), update when updating correspondence. The third term Ψ_(regu) is formula (9). The large rotation of each part in the iterative process is constrained. The motion between two adjacent frames is regarded as the process of simultaneous change of each part Ψ_(ragu)(R _(k))=∥R _(k) −I∥ ²  (9).
 7. The method for three-dimensional human pose estimation according to claim 6, in step (3), Using the overlap rate θ_(overlap) and cost function value θ_(cost) of the input depth point cloud and the constructed human body model on the two-dimensional plane to determine whether the current tracking fails. Assuming that human limb motion segments have the repetitive characteristics in time series, the direction information of each body part is used to represent human three-dimensional motion, the upper and lower trunk parts are simplified into two mutually perpendicular main directions, each part of the limbs is represented by a direction vector, and the direction of the head is ignored, which is expressed as a formula (10): v=(v ₁ ^(r) , . . . ,v ₁₀ ^(r))^(r)  (10) Where v₁,v₂ correspond to the pairwise perpendicular unit directions of upper torso, lower torso, respectively, and v₃, . . . , v₁₀ correspond to the unit direction of all components except upper torso, lower torso, head.
 8. The method for three-dimensional human pose estimation according to claim 7, in step (3), PCA is used to extract the main direction [e₁, e₂, e₃] of the depth point cloud, and the minimum bounding box [w,d,h] of the main direction is used to represent the characteristics of the depth point cloud, which is formula (11): e=(we ₁ ^(r) ,de ₁ ^(r) ,he ₅ ^(r))^(r)  (11) If the cost function of matching is less than the threshold value in the tracking process θ_(overlap)≤θ₁ and θ_(cost)≤θ₂, the tracking is successful and update the database model D by extracting feature s [e,v]. The extracted characteristics [e,v] are saved in database as a pair of characteristic vectors. When the tracking fails, the Euclidean distance is calculated by using the characteristics e of the corresponding depth point cloud in the database, the first five positions {[e^((i)), v^((i))]}_(i=1) ⁵ with the smallest distance are found in the database, and the position with the highest overlap rate with the current input depth point cloud is retrieved by using v^((i)), i=1, . . . , 5 to recover the visible spherical distribution constraint point cloud manikin, so as to facilitate the recovery from the tracking failure. 